Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in MessageFilter #538

Closed
wants to merge 9 commits into from

Conversation

rhaschke
Copy link
Contributor

@rhaschke rhaschke commented Jul 1, 2022

A bug reported in ros-visualization/rviz#1753 pointed to race condition(s) in tf2_ros::MessageFilter.
I implemented a stress test to analyze the problem in detail. There were two issues:

  1. A race condition routing back to deadlock #91, switch to use a shared lock with upgrade instead of only a unique lock for message_mutex_ #101, Solve a bug that causes a deadlock in MessageFilter #144
    As pointed out in Solve a bug that causes a deadlock in MessageFilter #144 the deadlock arises due to an inversion of locking order:
    • MessageFilter::add(), when removing the oldest message, locks:
      • MessageFilter::messages_mutex_
      • BufferCore::transformable_requests_mutex_ (via cancelTransformableRequest())
    • BufferCore::testTransformableRequests() locks:
      • transformable_requests_mutex_
      • MessageFilter::message_mutex_ (via MessageFilter::transformable() callback)
  2. A race condition with CBQueueCallback
    When MessageFilter::signalMessage() is called asynchronously via a ROS callback queue, this call might still be in progress (holding Signal1::mutex_), while another thread might attempt to remove the MessageFilter.
    This results in a failed assertion as the MessageFilter's destructor attempts to destroy a locked mutex.

PR #144 attempted to resolve the first issue by postponing callback calls. However, this has the drawback of using references to TransformableRequests outside the protection by transformable_requests_mutex_. Thus, these requests might get deleted by another thread, resulting in a segfault as revealed by the stress test.
Here the deadlock is resolved in MessageFilter::add, canceling the requests outside the messages_mutex_ protection.

To resolve the second race condition, another mutex is introduced in MessageFilter, to ensure that CBQueueCallback::call() is finished before destroying the MessageFilter.

These two changes make the stress test pass.
@tfoote, please note that this is a PR against Melodic. I will prepare a port for Noetic as well.

rhaschke added 7 commits July 1, 2022 15:16
... revealing that propagating messages via a ROS callback queue is inherently unsafe.
If the filter is going to be destroyed, a CBQueueCallback::call() might still be in action,
locking the underlying Signal1's mutex_.
This will cause as assertion failure during destruction of the mutex:
boost::mutex::~mutex(): Assertion `!posix::pthread_mutex_destroy(&m)' failed.
When MessageFilter::signalMessage() is called asynchronouyls via a ROS callback queue,
this call might still be in progress while another (holding Signal1::mutex_), while
another thread might attempt to remove the MessageFilter.
This results in a failed assertion as the MessageFilter destructor attempts to destroy a locked mutex.

Here, by introducing another mutex in MessageFilter, we ensure that CBQueueCallback::call() is finished
before destroying the MessageFilter.
PR ros#144 attempted to resolve a deadlock by postponing calls to callbacks.
However, this has the drawback of using references to TransformableRequests -
outside the protection by transformable_requests_mutex_.
Thus, these requests might get deleted by another thread, resulting in segfaults:

terminate called after throwing an instance of 'boost::wrapexcept<boost::lock_error>'
  what():  boost: mutex lock failed in pthread_mutex_lock: Invalid argument

This reverts commit bfb8038.
As pointed out in ros#144 the deadlock arises due to an inversion of locking order:
- MessageFilter::add(), when removing the oldest message, locks:
  - MessageFilter::messages_mutex_
  - BufferCore::transformable_requests_mutex_ (via cancelTransformableRequest())
- BufferCore::testTransformableRequests() locks:
  - transformable_requests_mutex_
  - MessageFilter::message_mutex_ (via MessageFilter::transformable() callback)

PR ros#144 attempted to resolve the issue by postponing callback calls.
However, this has the drawback of using references to TransformableRequests -
outside the protection by transformable_requests_mutex_.
These requests might got deleted by another thread meanwhile!

Here the deadlock is resolved in MessageFilter::add, cancelling the requests
outside the messages_mutex_ protection.
@rhaschke rhaschke requested a review from tfoote as a code owner July 1, 2022 13:47
@rhaschke rhaschke marked this pull request as draft July 1, 2022 14:09
@rhaschke
Copy link
Contributor Author

rhaschke commented Jul 1, 2022

Looks like there is still a deadlock 😞

@rhaschke
Copy link
Contributor Author

rhaschke commented Jul 1, 2022

Looks like there is still a deadlock

Resolved as well.
But I still see occassionally a race condition when using a ROS callback queue to signal messages, causing the assertion failure of issue (2).

@rhaschke rhaschke marked this pull request as ready for review July 1, 2022 15:58
While MessageFilter::clear() removes pending messages from the callback queue,
there might be still callbacks active when attempting to remove the MessageFilter.
Depending on what goes first, either the already destroyed mutex is tried to lock
or a locked mutex is tried to destroy, both resulting in an assertion failure.

This race condition cannot completely avoided.
Using a shared mutex for all callers and moving the unique lock to the end
of the destructor hopefully reduces the probability of such a race condition.
@rhaschke
Copy link
Contributor Author

rhaschke commented Jul 1, 2022

I'm afraid the remaining race condition cannot be fixed. Rather, the ROS callback queue mechanism should be avoided.

While MessageFilter::clear() removes pending messages from the callback queue, there might be still callbacks active when attempting to remove the MessageFilter. Depending on what goes first, either the already destroyed mutex is tried to lock
or a locked mutex is tried to destroy, both resulting in an assertion failure.

@ahcorde
Copy link
Contributor

ahcorde commented Feb 2, 2024

This PR targets melodic which is EOL, closing this PR. There is already another PR targeting noetic #539.

@ahcorde ahcorde closed this Feb 2, 2024
@rhaschke
Copy link
Contributor Author

rhaschke commented Feb 2, 2024

I'm fine that you closed this PR, @ahcorde. But it is a pity that this PR was lying around for 19 months w/o any feedback and now being closed due to EOL reasons. I hope the Noetic port (#539) will be considered at some point 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants